Activation extending based on long-range dependencies for weakly supervised semantic segmentation

Weakly supervised semantic segmentation (WSSS) principally obtains pseudo-labels based on the class activation maps (CAM) to handle expensive annotation resources. However, CAM easily involves false and local activation due to the the lack of annotation information. This paper suggests weakly supervised learning as semantic information mining to extend object mask. We proposes a novel architecture to mining semantic information by modeling through long-range dependencies from in-sample and inter-sample. Considering the confusion caused by the long-range dependencies, the images are divided into blocks and carried out self-attention operation on the premise of fewer classes to obtain long-range dependencies, to reduce false predictions. Moreover, we perform global to local weighted self-supervised contrastive learning among image blocks, and the local activation of CAM is transferred to different foreground area. Experiments verified that superior semantic details and more reliable pseudo-labels are captured through these suggested modules. Experiments on PASCAL VOC 2012 demonstrated the proposed model achieves 76.6% and 77.4% mIoU in val and test sets, which is superior to the comparison baselines.


Introduction
Semantic segmentation [1][2][3] can generate a regional mask containing semantic information for the input images.It has been widely used in various fields such as Medical imaging analysis [4], automatic driving [5], and Uav application [6].However, most existing methods still rely on the manually labeled pixel-level label, which is hugely resource-consuming.In recent years, researchers are committed to the semantic segmentation method with semi-supervision [7], non-supervision [8] and weak supervision.Compared to semi-supervised methods, weakly supervised methods offer lower annotation costs, superior performance compared to unsupervised methods, and are better suited for semantic segmentation tasks [9], that is, to utilize cheaper annotations as supervisory information of the backbone networks, for instance, image-level classification labels [10,11], scribbles [12], and bounding boxes [13], etc.These methods effectively reduce the implementation cost of this vision task.However, due to the lack of annotation information, this requires the model to discriminate the edge and shape of the object more finely.This paper's method focuses on generating image-level classification labels through long-range dependencies between pixels.
In the WSSS scene, the mostly schemes are suggested to extract practical information only by providing weaker labels than pixel-level labels [14], and convert the weak labels containing almost no object position information into image segmentation masks [10,15].Also, class-activated mapping (CAM) [16] is an effective solution to generate pixel-level pseudo-labels through image-level classification labels.However, due to the discriminant mode of the classifiers [17,18], and these labels contain limited spatial details [19], that often leads to the local activation regions [20], and the segmented object boundaries easily involve false activation.They thus will cause different degrees of fragmentary masks [21].A lot of recent work has refined the quality of CAM by mining more semantic and object location information from limited annotation information [10,11,18].The success of these methods depends on the long-range dependencies [22] between pixels in an image.Long-range dependencies modeling can effectively improve the scene understanding ability of deep neural networks [23].Still, these methods often use stacked convolution operations to obtain larger receptive fields to obtain this relationship [24].Such repeated local operations make the computational complexity of the network too high.It is not conducive to network optimization [25].As a non-local means operation, the self-attention mechanism [13] can calculate the correlation of elements at different spatial locations [26].[27,28] adopts self-attention mechanism to capture long-range dependencies to improve the prediction ability of CAM.However, these methods still have two drawbacks: (1) The long-range dependencies will mislead the image-level classification model to learn the false correlation between pixels and labels [29].(2) These methods ignore the rich long-range dependencies between image samples [30].Taking Fig 1 as an example, when the classification task is the goal, the correct classification of different classes benefits from their context, but in the segmentation tasks, this dependency is overemphasized, and the inter-pixel causal intervention [31] will make it difficult for CAM's prediction to distinguish the boundaries.The lack of labeling information of pseudo-labels is the main reason for the performance gap between weakly and fully supervised models [32,33].How to establish the long-range dependencies between the same class between samples is also the main point of this paper.
According to the above, the bottleneck lies in how to mine more semantic information effectively and avoid information confusion caused by long-range dependencies to generate highquality CAM.This paper formulates two novel modules to solve these difficulties.The first is the modified self-attention module [26], which is inspired by the puzzlecam [34] block method to carry out feature extraction on a smaller area, and explicitly cut off the similarity calculation between easily confused classes in the sample.This not only provides a kind of data enhancement for network training, but also provides richer and more accurate samples for subsequent contrastive learning.The second module is the foreground feature contrast based on crossimage analysis.It leverages existing feature information to enable pixel-level self-supervised contrast learning without negative samples.It can be used to strengthen relationships between similar prospects in different samples, and the loss is calculated by rank weight, which reduces the interference between different classes.To sum up, this paper has achieved the following points: • We propose semantic mining to compensate the lack of annotation information in WSSS, modeling through long-range dependencies between in-sample and inter-sample, global and local, which narrows the gap between weak and full supervision; • The proposed region self-attention module (RSA) calculates the correlation among pixels within a given sample area, using modified non-local self-attention to mitigates the information confusion caused by causal intervention and reduces false activations of CAM.
• The proposed cross-image contrast module (CFC) employs global and local weighting in contrastive learning.Reframe foreground features as positive samples, minimize the feature distance between samples of the same class.It effectively extends the local activation of CAM to the entire target area.
• Our proposed approach does not involve refinement of the CAM via additional networks, achieves 76.6% and 77.4% mIoU in val and test sets on PASCAL VOC 2012 and 43.8% mIoU in val set on MS COCO, the performance exhibited is of a superior nature.

Weakly supervised semantic segmentation
The strategies for pseudo-pixel label generation based on semantic information mining can be divided into region mining and cross-image mining.Among them, the region mining strategy [11,26,35] focuses on the pixel correlation of the single image, [36] drives the classification network to discover new and supplementary target regions sequentially by erasing the currently mined areas in an antagonistic manner, This approach also essentially breaks down the causal interference between pixels brought about by the long-range dependencies.[20] scheme was proposed to provide recalibration supervision for the CAM to some extent solve the CAM overactivation problems.SEAM [32] suggests a self-supervised equal-variable attention mechanism to narrow the gap between weak and complete supervision.Some studies have also explained CAM generation from a new perspective, such as causal reasoning, information bottleneck theory [37], and anti-resistant aggression [14].However, these methods need to take advantage of the rich long-range dependencies between samples.There are also methods for refining CAM based on mining semantic information crossimages [9,38], SUN [39] proposed two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences complimentarily.CIAN [33] suggested a cross-image attention module to learn activation mapping from two images containing objects of the same class under the guidance of saliency mapping.But these methods require additional data annotation.CCAM [30] proposes to generate a class-agnostic activation map for contrast learning as a background cue for CAM, but this approach requires additional classifier generation to be a class-agnostic activation map.Our approach is based on cross-image and region mining, and no more convolution modules or annotation data sets were added.

Non-local self attention operation
In deep neural networks, convolution and cyclic operations deal with local space or time neighborhoods.Therefore, long-range dependencies can only be captured when these operations are repeatedly applied.The self-attention module can compute the response at a location by attending to all locations and taking their weighted average in the embedding space [26].[27] integrate class-agnostic saliency priors into the self-attention mechanism and utilize classspecific attention cues as additional supervision.[40] propose an Unbiased self-attention learning segmentation network, which designs unbiased layers to guide the network to expand the discrimination field of CAM during training.[41] propose an edge-based self-attention mechanism to strengthen the nodule edge segmentation effect.Influenced by the self-attention mechanism of VIT [24], SEAM [32] introduces a pixel-dependent module (PCM) that captures contextual appearance information for each pixel and modifies the original CAM with a learned affinity attention map.[26] proposed a non-local operation to calculate the response of one location as a weighted sum of all location features, as a means of image information mining, to capture long-distance context relations.But essentially, context is an obfuscator that creates false causal interventions between pixels [29].

Class activates mapping (CAM)
In this section, we describe the procedure for generating class activation maps (CAM) [27], CAM can show the distribution of contributions to classification on the original images.The process of the method in this paper is shown in Fig 2 .Given a training set of images is defined as of step 2 is performed on the original size image X at the same time.The output feature maps of the last convolutional layer generated through the encoder Γ(�) is defined as A fully connected layer Λ(�) with parameter A 2 R D�G is later used to retrieve the classification scores, where G is the number of classes.The prediction score of two steps for class p is: where Ŷ x;ŷ;z , Y x,y,z represents the activation of Ŷ and Y at its spatial location x; ŷ and x, y on the z channel.We generate CAM for two steps by a weighted linear sum of visual patterns at different spatial locations.This process is described as follows: where Ĉp and C p represent the CAM of class p, the CAM Ĉ and C for all classes is obtained by concatenating Ĉp and C p , the activation function Relu is then applied to Ĉ and C to mask irrelevant pixels to obtain the final visual version of CAM, as shown in Fig 2 .It is worth mentioning that global average pooling can be applied to CAM in practice to obtain a vector of classification prediction scores for all classes, which is equivalent to the set of all class prediction scores in Eq 1.
CAMs can be used as initial seed regions for pseudo-labels [32].As seen from the above methods, such CAM is based on the classification task as the target, so it will cause the CAM to be often limited to the area with higher classification prediction scores [20], which is unfavorable for our pixel-level segmentation task.Two kinds of modules are proposed to obtain more reasonable CAM prediction masks, which will be elaborated on in the next section.

Regional self-attention module (RSA)
RSA is a module that captures contextual information and optimizes pixel-level prediction results.The detailed design of the proposed RSA module is shown in Fig 3 .We improve the self-attention module proposed by [26].The classical self-attention module given in this method is calculated as follows: in Eq 3, C x,y,z and C * x;y;z represent the original CAM and the modified CAM with spatial position x, y on the z channel.And function θ, ϕ, φ denote three separate 1×1 convolution operations.CAM is optimized by computing the similarity dot product between activations Y x,y,z and Y i,j,z of Y at spatial locations x, y, z and i, j, z, and g(Y) represents normalization factor.First, redundant convolutional layers and residuals are removed to reduce the number of parameters.Secondly, RSA is applied to the extracted region features.That is, pixel correlation prediction is carried out in the region.
This strategy has several benefits: the region features have a larger field of attention and fewer categories, so the causal interference between pixels can be reduced.It can be beneficial for CAM to have correct coverage of interesting objects.After that, the regional features' similarity matrix is calculated to weigh the original CAM.We describe this process as follows: where C * x;y;z is the final modified version of CAM at spatial position x, y, z, and normalized by gð Ŷ Þ. in Eq 6, Ŷ T x;y;z � Ŷ i;j;z represents a similar dot product of region features, and the activation function Relu is used in f ð Ŷ x;y;z ; Ŷ i;j;z Þ to mask irrelevant pixels., where ĉ * re;i 2 R H�W�G .The global average pooling layer (h GAP ) is used to obtain the prediction vector for calculating the classification loss:

After obtaining the region refinement set of CAMs
in Eq 7, where V and V represents the prediction vector obtained by C * re and C through the global average pooling layer.In this study, we only use image-level classification labels to predict the generation of pixel-level pseudo-labels.The multi-label classification loss is used to calculate the classification loss, as shown in Eq 9.
where G represents the number of classes.The two steps classification loss is shown in Eq 9: where l represents the label for classification, this is the only annotation label we use.In the training process, L cls is simultaneously used as supervision signals for two steps to improve the performance of the classification network.Through the RSA module, we obtain the refined CAM calculated by the pixel correlation in the region.The refined regional CAM is used as the foreground contrast sample of the CFC module to improve the quality and quantity of the sample to carry out contrastive learning better.

Cross-image foreground feature comparison module (CFC)
The proposed CFC module, as depicted in Fig 3, aims to enhance the location accuracy of CAMs by leveraging contrastive learning to mine long-range dependencies across different samples.The multiple features , where ŷi 2 R M�N�D�4 and ĉ * i 2 R M�N�G�4 , ŷi and c * i extracted from step1 are used to construct foreground vector f i as: where . It represents the foreground vector of the image block.We utilize it as a positive samples for contrastive learning.To enhance the network's robustness to batch size and facilitate its learning process [42], negative sampling is not employed in this case.Then, the rank weights between different sample pairs are calculated by computing the global cosine similarity between foreground vectors in the batch, that is, semantically (appearance, color, or texture) similar pairs are given more weight, and less similar pairs are given less weight, which is used to reduce information confusion between different classes.Different from [30], firstly, we directly apply contrastive learning to multi-channel CAMs, utilizing it as a self-supervisory signal and refining the CAMs through loss calculation during training without introducing additional networks for generating comparison samples.Secondly, our approach is characterized by a dual focus on both global and local similarity, taking into account not only the overall resemblance between images but also the similarities within individual image blocks.S ðjÞ;ðsÞ i;r represents the local similarity matrix between foreground vector matrices f ðjÞ i and f ðsÞ r : S ðjÞ;ðsÞ i;r then the similarity matrix S i,r between image patches in two different images can be obtained, where S i;r 2 R 4�4 .This similarity matrix contains the similarity scores corresponding to all locations in f i and f r .and the global weight between f i and f r is defined as: in Eq 13, α is the weight index used to control function smoothing, x, y representative location index, and the range of W i,r is between 0 and 1.The cross-image foreground feature vector contrastive loss is shown in Eq 14.It serves as an auxiliary supervision for step 1 generation CAMs.We shrink their feature distance in a self-supervised form during training.To prevent the confusion of different classes of prospects in the training process, W i,j is adopted to calculate the loss: where I i6 ¼r 2 f0; 1g, it is equal to 0 if i = r.L B pos represents the contrastive loss within batch B with batch size b, and L pos ¼ P U B¼1 L B pos , U is the total number of batches.Our ultimate aim is to guide the optimization of the C generated in step 2 through the Ĉ * generated in step 1 during training.To make the two steps achieve equivariance learning in the training process, Eq 15 follows the reconstruction regularization proposed by [32,34], it is reconstruction loss for the original CAM, where Ĉ * re represents the merge version of Ĉ * : the total loss is given in Eq 16, where λ and β is the weight coefficient: During the training process, the model's parameters are updated in the number of iterations T until the model is fitted, thus expanding the activated region of the CAM to the target actual region.The classifier Λ(�) is backpropagated through the gradient to the feature extractor Γ(�), we define this part of the parameters as υ.
Algorithm 1: The training process

Input:
The training set of images: Output: The set of CAMs: LðYÞ; 5 Ĉ * Extend Ĉ via Eq 4; 6 Extend Ĉ * via Eq 14; Extend C via Eq 15; 9 Update υ via Eq 16; 10 end for testing, given a testing set of images is defined as K ¼ fkig R i¼1 .We used steps without modules to generate CAMs for each images.
Algorithm 2: The testing process

Input:
The testing set of images: Output: The set of CAMs: 1 Initialize: The parameters υ of Λ(�) and Γ(�) are loaded using the pre-trained model; GðXÞ;

LðYÞ
According to the above steps, we obtain a CAM generative model trained by source samples and image-level classification labels, after which the conventional two post-processing steps are followed: (1) CAM regions are selected as seed regions by threshold [11].(2) Expand it as the final pseudo-label [18].And its visualization results are shown in Experimental Results and Disscusion.From the method structure, it is easy to see that the RSA module and CFC module complement each other, the RSA module provides more abundant high-quality samples for the CFC module, and CFC guides the generation of CAM of the RSA module through loss, which we will also prove in our experiments.

Implementation details
Datasets.PASCAL VOC 2012 [43] is currently the most widely used natural scene image data set in weakly supervised image semantic segmentation based on image-level labels.In the weakly supervised semantic segmentation task, use image-level labels for pseudo mask generation and pixel-level labels for validating semantic segmentation results-training using an enhanced training set of 10,582 images, 1,449 for validation, and 1,456 for testing.MS COCO 2014 dataset [44] consists of 80classes,with 82,783 and 40,504 images for training and validation.In all experiments, the image is randomly scaled in the range of [320, 640] and then clipped to 512 × 512 as the network input.
Evaluation index.The mean Intersection over Union (mIoU) [45] was used as the overall performance evaluation index of the experiment's pseudo-label generation end and segmentation end.The calculation formula is as follows: mean false discovery rate (mFDR) and mean false negative rate (mFNR) are used as the CAM's prediction performance evaluation index.Specifically when the CAM can cover more object target areas, the value of mFNR will be smaller.When the false activations of CAM are less, the mFDR is smaller [19,32]: where TP p denotes the pixel number of accurate positive prediction of class p; FP p and FN p indicate the number of false positive and false negative predictions of class p.

Comparison experiment
Training details.In this study, experimental hardware equipment is CPU: 15 vCPU Intel (R) Xeon(R) Platinum 8358P CPU @ 2.60GHz; GPU: A100-SXM4-80GB(80GB) * 1.The initial learning rate of the generator is 0.01, the batch size is 32, and the maximum iteration number T is 4.5k.The comparative experiment of pseudo-label quality was performed with the previous methods, all of which were performed on the voc2012 dataset.
The proposed method uses PuzzleCam as baseline and analyzes two backbone networks, resnet50 and resnest101.According to the verification on the voc2012 training dataset, Table 1 shows the mIoU of the CAMs generated by the proposed method for different combinations of λ and β values.Where L re learns the difference between the regions of interest of the full and segmented images, and L pos brings the feature representations of similar foreground classes closer together during training, during the training process, we found that L pos is more sensitive than L re , so we set a minor parameter change for L pos to select an appropriate λ, and α = 0.25.As can be seen from the table, when λ = 0.5 and β = 2, the generated CAMs achieve the highest mIoU, and the subsequent experiments are also carried out under the parameter setting.
Comparison with baseline.The quality of pseudo-labels determines the performance of weakly supervised semantic segmentation networks.Table 2 shows that Under different backbone conditions after adding the two modules RSA and CFC proposed in this paper, the pseudo-labels generated by the proposed method increase by 5.22 and 3.43 relative to Puzzle-Cam and proves that our semantic information mining network effectively improves pseudolabel quality.
The results of pseudo-labels visualization are shown in Fig 4 .It can be seen that when there is single or multi-class information in the scene, the pseudo-labels generated by us have a more accurate prediction range.Taking the first column as an example, through the self-supervised signal introduced by our CFC module, ours has a more accurate prediction of the foreground and background compared to PuzzleCam.For example, in PASCAL VOC2012, because persons and motorcycles often appear in the same scene, PuzzleCam will predict the false correlation between them, resulting in a boundary range that is difficult to distinguish between different classes.Our approach benefits from guiding the activation of the CAM by pixel correlations at a regional scale, effectively reducing the causal intervention of inter-pixel errors.
The segmentation result is one of the criteria to measure the quality of the pseudo-label.To further prove the effectiveness of the proposed method, we use the pseudo-labels generated by the method based on resnest101 in this paper to train DeeplabV3+ [36], the segmentation results on PASCAL VOC2012 validation set are shown in Fig 5 .With the same segmentation end, Fig 5 shows that our DeeplabV3+ achieves high-quality segmentation results in different scenarios even though we do not use any saliency label supervision during training, especially in complex scenes, the results of PuzzleCam often fall into misjudgment in some ambiguous regions.For example, our segmentation results can accurately determine the boundary range  Comparison with mainstream method.In order to analyze the superiority of our method more clearly, the prediction accuracy of each class and background is compared with other mainstream methods in mIoU on the PASCAL VOC 2012 val dataset.As seen from Table 3, the table's highest value is in bold.For some easily confused classes, such as bicycles, horses, and motorcycles, which often appear in the same scene with people, our method achieves more accurate prediction results, which is essentially due to the correct long-range dependencies learned by the network through our method.However, for example, birds and ships, which are usually small in area, are difficult to match with foreground classes with higher similarity in the batch size after blocking, so our prediction results have low accuracy.
The proposed method is compared with other methods on the PASCAL VOC 2012 test datasets.As seen from Table 4, the segmentation model trained by our pseudo-labels also achieved superior results and excellent generalization ability on the test set.Among them, SEC [10] and AffinityNet [19] also improve the prediction of cam through the context relationship between pixels, and our segmentation results are improved by 11.8% and 2.1%, respectively.And different from AffinityNet [19], the method of generating CAM in this paper is end-toend, and no other network is used to refine the CAM further.We compare our trained DeeplabV3+ with other current mainstream semantic segmentation methods trained on image-level labels and saliency labels.For a fair comparison,and follow the established process used in previous work, Random Walk (RW) [19], dense conditional random field (CRF) [46] are used in this experiment to refine the generated pseudo label further.Table 5 shows that our proposed method achieves the highest mIoU on both the val set and the test set of PASCAL VOC2012.Among them, CIAN [33] is also used for long-range dependencies learning across images.Compared with CIAN [33], the proposed method improves the validation set and test set by 0.6% and 0.5%, respectively.Compared with MDC [25], the proposed method improves the mIoU by 4.5% and 5.0% on the val and test sets, respectively, with weaker annotations.
We further evaluate the performance of our model on MS COCO 2014, where pixel-level annotations are available, We solely utilized image-level class labels during the training procedure.It should be noted that in order to reduce computational costs, we have opted to train on a subset of the training images, specifically 50% (40k) images.Experimental Results Table 6 compares our approach and current WSSS methods with image-level supervision on the COCO dataset.We can observe that our method achieves mIoU score of 34.2% on the val set, outperforming all the competitors.
The above visual experimental results show that the proposed pseudo-label generation method has a more accurate region mask than similar methods.The segmentation network trained with our pseudo-labels achieves the highest prediction accuracy, effectively narrowing the gap between weakly supervised and fully supervised methods.

Ablation experiment
To verify the independent validity of the two modules, ResNet50 and ResNset101 were used as the backbone network for analysis.As seen from Table 7, when the CFC module is added to the baseline, the mIoU of the pseudo-label is improved by 2.57%.When the RSA module is added to the baseline, the mIoU of the pseudo-label is improved by 4.45%.The best mIoU is achieved when the two modules are used in parallel, with 5.22% improvement over baseline.
The experimental results show that the two modules proposed in this paper effectively improve the pseudo-label quality, and the effect is best when combined.We visualize the CAM effects achieved by combining different modules.Fig 6 shows CAMs generated by the different modules, and the CAMs shown are the set of all class predictions.It can be seen that baseline-based CAMs tend to be limited to regions of salient features of objects, such as the wheels of a motorcycle.When the CFC module is added, the prediction region of CAMs expands from the salient regions of the target to other regions.When the RSA module is added, the false activation area of CAMs is visibly reduced.Cams have richer detail and more accurate activation when CFC and RSA are used together.This benefits from more reliable pixel semantic information mined through the two modules designed in this paper.
To better understand how our method can effectively mine out more pixel-level semantic information, we sample the generated CAM at different iterations.higher mIoU means that the pseudo-label has higher overall prediction accuracy.Combining the two modules is beneficial to improve the quality of pseudo-labels further.The main performance gains come from the effectiveness of CFC and RSA and the cooperation of CFC and RSA, in which the correct long-range dependencies are learned from intra-sample and intersample.It can be seen that with the increase of training rounds, through the supervision information we add, the activation area of CAMs is effectively extended from the salient local area of the target to other regions.The CAMs have smoother and more complete boundaries, some targets normally ignored by the network, such as the human body and chair, were also activated after training.
The above ablation experiments show that the two core modules in the proposed framework effectively improve the quality of pseudo-labels generated based on image-level labels when evaluated separately.When the two modules are combined, higher pseudo-label accuracy is achieved.The effectiveness of the additional supervision incorporated in this paper is demonstrated by visualizing the accuracy and expansion trend of CAMs during training.It reduces the CAM fragmentation activation problem caused by the classification tasks.And the overall experiment shows that through two modules embedded in classification network, we successfully mined richer semantic information and greatly improved the executable of weakly supervised learning.

Conclusions
In this paper, a novel weakly supervised semantic segmentation framework is proposed.We extend the CAMs generated by the classification network, using the long-range dependencies.We propose the cross-image foreground feature contrast module and the regional self-attention module, which take into account both inter-sample relationships and information confusion arising from such dependencies.The results demonstrate that these two modules effectively extract more semantic information and accurate target range regions, resulting in a CAM with expanded coverage over the entire target area and fewer false predictions.The method enhances the precision of pseudo-labels for semantic segmentation networks.However, there is a need to improve the accuracy of small object detection when generating pseudo labels.

Fig 1 .
Fig 1.(a) Causal intervention in the sample caused by long-range dependencies.The same class is represented in the same color, where A $ D and A1 $ D1 represent false causal between different classes of regions.(b) Modeling of long-range dependencies between regional samples.The long-range dependences of the same classes in different samples(A $ A1,D $ D1) is established and the false causal between different classes is cut off.Each class captures long-range dependencies through pixel dependencies in a smaller area.https://doi.org/10.1371/journal.pone.0288596.g001

Fig 2 .
Fig 2. We propose a two-step network structure.CAMs in the figure result from the visualization of all the classes.The Ĉ * generated by RSA was restored to the size of C by the Merge module after being corrected by the cross-image comparison module, C generated by two-step provides equivariant constraint supervision for the merged version of Ĉ * re .https://doi.org/10.1371/journal.pone.0288596.g002

Fig 3 .
Fig 3.The two modules applied in step 1 of Fig 2 is described in detail.The reshape represents tensor size transformation, � Stands for matrix dot product operation.RSA module: The similarity matrix with the calculated size of MN × MN is normalized,The refined CAM Ĉ * can be obtained by weighting the original CAM Ĉ. CFC module:The foreground vector is formed by matrix multiplication of ŷi and ĉ * i .https://doi.org/10.1371/journal.pone.0288596.g003

Fig 4 .
Fig 4. Pseudo masks on PASCAL VOC 2012 train dataset.From top to bottom are original images; ground truth; The prediction results of PuzzleCam; The prediction results of our method.https://doi.org/10.1371/journal.pone.0288596.g004

Fig 5 .
Fig 5. Pseudo masks on PASCAL VOC 2012 val dataset.From top to bottom are original images; ground truth; The prediction results of PuzzleCam; The prediction results of our method.https://doi.org/10.1371/journal.pone.0288596.g005

Fig 6 .Fig 7 .
Fig 6.Ablation experiment of CAM.ResNet50 was used as the backbone network for analysis.CAM in the figure is generated by PASCAL VOC2012 Training Dataset.https://doi.org/10.1371/journal.pone.0288596.g006 Fig 7 visualizes the decrease in loss and the improvement in model accuracy as the number of iterations increases during training.When the CFC and RSA modules are added using the same backbone, the experimental results show that the pseudo-label achieves lower mFDR and mFNR, indicating that the generated CAMs cover more target area and fewer false predictions.At the same time, the

Fig 8
visualizes the CAMs generated by the final version(c) under different training rounds.
the Chunk module for step 1, where xi ¼ ½x ð1Þ i ; xð2Þ i ; xð3Þ i ; xð4Þ i �, and CAM Ĉ ¼ fĉ i g Q i¼1 is later generated for each xðjÞ i , where ĉi ¼ ½ĉ ð1Þ i ; ĉð2Þ i ; ĉð3Þ i ; ĉð4Þ step 2, where ŷi ¼ ½ŷ ð1Þ i ; ŷð2Þ i ; ŷð3Þ i ; ŷð4Þ i �, ŷðjÞ i 2 R M�N�D and y ðjÞ i 2 R H�W�D , D represents the number of channels, and M×N and H×W represents the size.

Table 5 . Comparison of our proposed method and existing state-of-the-art methods on the PASCALVOC2012 val and test
. I, image-level labels; S, saliency label.